Some R stuff:
Code button.Code button to hide
all code if you just want to focus on the analysis.Given that we cannot access the data and I was under the impression
that our coding walk-through today was somewhat helpful, I attempted to
create a template for the analysis that you can use for the data. I was
not able to access hospitalizations on the county level
from the Covid19 DataHub so I
will be using deaths inlieu of
hospitalizations for this walk-through.
This template makes the following assumptions:
The COVID-19 DataHub provides the data at three units of
aggregation: country, state, and
county. From my understanding, you are modeling data for
Greater St. Louis (i.e., St. Louis MSA). To get the data similar to what
Peter has access to, I summed both cases and deaths for counties in both
Illinois (‘Bond’, ‘Calhoun’, ‘Clinton’, ‘Jersey’, ‘Macoupin’, ‘Madison’,
‘Monroe’) and Missouri (‘Crawford’, ‘Franklin’, ‘Jefferson’, ‘Lincoln’,
‘St. Charles’, ‘St. Clair’, ‘St. Louis’, ‘St. Louis City’, ‘Warren’)
based on the Greater St. Louis
Wikipedia Page.
On top of my head, I did not remember the study period that you
had for the analysis. So I have used data starting from
Jan 04, 2021 (first Monday of the Year) for this
analysis.
For the St. Louis Health Department, I assumed that the
non_weekend holidays matched those of the New York Stock Exchange. These
were extracted using
HOLIDAY_SEQUENCE(start_date = '2021-01-04', end_date = '2022-08-01', calendar = 'NYSE')
from the tidyquant
package. The exact holiday dates from this function call are:
2021-01-18, 2021-02-15, 2021-04-02, 2021-05-31, 2021-07-05, 2021-09-06,
2021-11-25, 2021-12-24, 2022-01-17, 2022-02-21, 2022-04-15, 2022-05-30,
2022-07-04.
I kind of
eyeballed the dates for the dominant COVID-19 variant based
on the following
NY Times Article. I have used the plot within the article to define
the dominant variant to be as follows:
dominant_variant = case_when(date < ymd('2021-03-01') ~ 'Epsilon', date < ymd('2021-06-15') ~ 'Alpha', date < ymd('2021-12-15') ~ 'Delta', date >= ymd('2021-12-15') ~ 'Omicron') %>% as_factor().
This allowed me to take into account the ‘entire’ time-series rather
than just the Omicron part.
I assumed that one of the end goals from your analysis would be forecasting future hospitalizations. So far, in our meetings, we discussed explanatory modeling. So I divided my data into a training and a holdout sample – primarily for the purpose of showing you how its done. Obviously, you can override this by ignoring the holdout and just focusing on the training results (which you can also set to be the entire time period of interest).
I did not lag the confirmed cases. I believe you should test out different lags within the context of both the auto.arima() and the other models since we would be hypothesizing that the lagged confirmed cases would lead to future hospitalizations. It is my understanding that the auto.arima() does not account for the lagged relationships on the exogenous variables. See the Lagged Predictors in the fpp2 book. The implications of introducing the lagged predictors on the current code are as follows:
confirmed1 = lag(confirmed, k = 1) for lag1. Changing
k would change the lag.initial_time_split() to ensure a seperation between the
training and testing dataset. In that function, the default value for
lag = 0, and you would need to change it to the highest lag
you are evaluating.I ignored potentially relevant predictors such as the
stringency index and vaccines. In my opinion,
these can be important with the different waves. This would be nice to explore as you
continue to build the model.
How should we handle negative counts in
cases/deaths/hospitalizations? See the death count on
March 29, 2021 in the plot in Section 5.1 as an
example.
Should we worry about the residuals’ departure from
normality? See the results of the
Shapiro-Wilk Test in Section 7.2.2.
Do you agree with my comment about lagged predictors? See the second to last assumption in the Preface section.
The code chunk below contains the packages used in this analysis
# tidyverse pkg for data manipulation
if(!require(tidyverse)) {install.packages('tidyverse'); library(tidyverse)}
# readr for reading files
if(!require(readr)) {install.packages('readr'); library(readr)}
# timetk for tidy time-series preprocessing
if(!require(timetk)) {install.packages('timetk'); library(timetk)}
# tidyquant for tidy time-series analysis
if(!require(tidyquant)) {install.packages('tidyquant'); library(tidyquant)}
# modeltime for tidy forecasting
if(!require(modeltime)) {install.packages('modeltime'); library(modeltime)}
# for interactive plots
if(!require(plotly)) {install.packages('plotly'); library(plotly)}
# for tidy machine learning
if(!require(tidymodels)) {install.packages('tidymodels'); library(tidymodels)}
# for the xgboost model
if(!require(xgboost)) {install.packages('xgboost'); library(xgboost)}Per the Preface, I read the data used in this analysis from the COVID-19
DataHub. To not require the installing of the COVID19
package, I have downloaded the zip file and unzipped using R. Then, I
filtered the data to Missouri and Illinois, selected the appropriate
counties, and computed the variables that we are interested in.
# creating a temp file for downloading the data
temp = tempfile()
# download the file to temporary location
download.file("https://storage.covid19datahub.io/country/USA.csv.zip", temp)
# unzip and read the file
covid_tbl = unz(temp, "USA.csv") %>%
# reading the data from the CSV
read_csv() %>%
# filtering to Missouri and Illinois
filter(administrative_area_level_2 %in% c('Illinois', 'Missouri') )
st_louis_tbl = covid_tbl %>%
# Relevant Counties in Illnois
filter(
(administrative_area_level_2 == 'Illinois' &
administrative_area_level_3 %in% c('Bond', 'Calhoun', 'Clinton', 'Jersey', 'Macoupin', 'Madison', 'Monroe') ) |
# Relevant Counties in Missouri
(administrative_area_level_2 == 'Missouri' &
administrative_area_level_3 %in% c('Crawford', 'Franklin', 'Jefferson', 'Lincoln', 'St. Charles', 'St. Clair', 'St. Louis', 'St. Louis City', 'Warren'))
)
# Aggregating the counts by day (so that we have an approximation of total numbers for st. louis)
st_louis_agg_tbl =
st_louis_tbl %>%
# grouping by date so we can created an aggregated summation across all counties
group_by(date) %>%
select(date,
# variables to be summed across all counties in St. Louis
confirmed, deaths, recovered, hosp, icu, vent,
# population will be summed only to see if the aggregation is reasonable
# i.e., is the total pop close to the actual for the St. Louis MSA
population,
# other variables that can be included in the analysis later
stringency_index) %>%
summarise_at(
vars(confirmed, deaths, recovered, hosp, icu, vent, population),
.funs = sum, na.rm = T
) %>%
ungroup()
# remove temp file
unlink(temp)
# glimpse of the data
glimpse(st_louis_agg_tbl)## Rows: 878
## Columns: 8
## $ date <date> 2020-03-07, 2020-03-08, 2020-03-09, 2020-03-10, 2020-03-11…
## $ confirmed <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 7, 12, 20, 26, 40, 77, 94, 99…
## $ deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3,…
## $ recovered <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ hosp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ icu <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ vent <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ population <dbl> 994205, 994205, 994205, 994205, 994205, 994205, 994205, 994…
From the printout of the data, we have 8 variables and 878 days of
data. Note that the dominant_variant was eyeballed as
mentioned previously and the holidays variable assumes that
your health department matches the holidays taken by the NYSE.
From the plot below, it is clear that this public dataset does not contain information on hospitalizations, vent, and ICU. Thus, I will model deaths as a function of cases. Peter, you can change the variables based on your dataset.
It is also clear that the data is cumulative so we will difference to have the daily new cases and deaths.
st_louis_agg_tbl %>%
pivot_longer(cols = c(confirmed, deaths, recovered, hosp, icu, vent, population),
names_to = 'statistic') %>%
ggplot(
aes(x = date, y = value)
) +
geom_line() +
facet_wrap(~ statistic, scales = 'free_y', ncol = 1) +
theme_bw() +
scale_x_date(breaks = scales::pretty_breaks(n = 12)) +
scale_y_continuous(labels = scales::comma) +
labs(title = 'Plots of the Time-Series of Potential Variables for St. Louis MSA',
subtitle = 'Hosp, ICU, Recovered and Vent have no data',
caption = 'Based on data aggregated from the COVID19 DataHub',
x = 'Date',
y = NULL)-> p1
ggplotly(p1)Based on our explanatory analysis of the data, there seemed to be
some large outliers in the beginning of 2021. So I am starting the
analysis from the first Monday of the year. I differenced to compute the
variables of interest. The code below plots the two time-series of
interest (confirmed cases and
confirmed deaths), color coded based on the COVID-19
dominant variant.
st_louis_agg_tbl =
st_louis_agg_tbl %>%
# keeping only relevant columns for forecasting
select(date, confirmed, deaths) %>%
# converting the cumulative counts to new
mutate(
confirmed = confirmed - lag(confirmed),
deaths = deaths - lag(deaths)
) %>%
filter(date >= ymd('2021-01-04')) %>%
mutate(
# creating a data frame of variants based on
# https://www.nytimes.com/interactive/2021/health/coronavirus-variant-tracker.html
dominant_variant =
case_when(
date < ymd('2021-03-01') ~ 'Epsilon',
date < ymd('2021-06-15') ~ 'Alpha',
date < ymd('2021-12-15') ~ 'Delta',
date >= ymd('2021-12-15') ~ 'Omicron'
) %>% as_factor(),
# creating a list of special holidays
holidays =
if_else(date %in% HOLIDAY_SEQUENCE(start_date = min(date),
end_date = max(date),
calendar = 'NYSE'),
true = 'yes',
false = 'no') %>% as_factor()
) %>%
drop_na()
st_louis_agg_tbl %>%
pivot_longer(cols = c(confirmed, deaths),
names_to = 'statistic') %>%
ggplot(
aes(x = date, y = value, color = dominant_variant)
) +
geom_line() +
facet_wrap(~ statistic, scales = 'free_y', ncol = 1) +
theme_bw() +
scale_x_date(breaks = scales::pretty_breaks(n = 12)) +
scale_y_continuous(labels = scales::comma) +
scale_color_brewer(palette = 'Paired') +
labs(x = 'Date',
y = 'New Daily Counts',
title = 'Plots of Confirmed Cases and Deaths for St. Louis MSA') -> p2
ggplotly(p2)An important question to address (if this is in Peter’s data) is how
do we handle negative counts. A naive fix in R would be
to use the max of the value and 0 (see the pmax()).
However, this ignores the misreporting of the data.
Below, I have capitalized on the initial_time_split() to
split the time-series such that the first 90% of the data are used for
model building and the remaining 10% are used for validation. We
can redo this if we solely want to focus on the explanatory modeling
component.
splits = initial_time_split(st_louis_agg_tbl, prop = 0.9)
# Printing out the splits so that we know the number of observations used for model training
splits## <Training/Testing/Total>
## <517/58/575>
# Training data start and end dates
paste('The starting and ending dates for training are',
splits$data[splits$in_id, 'date'] %>% head(n=1) %>% pull(),
'and',
splits$data[splits$in_id, 'date'] %>% tail(n=1) %>% pull(),
'respectively. For the holdout data, the starting and trainig dates are',
splits$data[splits$out_id, 'date'] %>% head(n=1) %>% pull(),
'and',
splits$data[splits$out_id, 'date'] %>% tail(n=1) %>% pull())## [1] "The starting and ending dates for training are 2021-01-04 and 2022-06-04 respectively. For the holdout data, the starting and trainig dates are 2022-06-05 and 2022-08-01"
In this section, I have quickly trained the following five models:
as.numeric(date) + confirmed + holidays + dominant_variantI have capitalized on the modeltime
package vignette to quickly run these models. Note that the
modeltime API also allows for running other machine
learning models for time-series applications.
# a basic univariate ARIMA model using “Auto Arima” using arima_reg()
# using the modeltime pkg this will automatically pick the weekly seasonality
model_fit_arima =arima_reg() %>%
set_engine(engine = "auto_arima") %>%
fit(deaths ~ date, data = training(splits) )## frequency = 7 observations per 1 week
# ARIMA with xreg
model_fit_arima_xreg = arima_reg() %>%
set_engine(engine = "auto_arima") %>%
# confirmed, holidays and dominant variant as our xreg
fit(
deaths ~ date + confirmed + holidays + dominant_variant,
data = training(splits)
)## frequency = 7 observations per 1 week
# Auto Arima Model with xgboost on the Arima errors
model_fit_arima_boosted = arima_boost(
min_n = 2,
learn_rate = 0.015
) %>%
set_engine(engine = "auto_arima_xgboost") %>%
fit(deaths ~ date + confirmed + holidays + dominant_variant,
data = training(splits) )## frequency = 7 observations per 1 week
# the prophet model used by Facebook
model_fit_prophet = prophet_reg() %>%
set_engine(engine = "prophet") %>%
fit(deaths ~ date + confirmed + holidays + dominant_variant,
data = training(splits) )## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
# Linear Regression
model_fit_lm = linear_reg() %>%
set_engine("lm") %>%
fit(deaths ~ as.numeric(date) + confirmed + holidays + dominant_variant,
data = training(splits) )
models_tbl = modeltime_table(
model_fit_arima,
model_fit_arima_xreg,
model_fit_arima_boosted,
model_fit_prophet,
model_fit_lm
)
models_tbl## # Modeltime Table
## # A tibble: 5 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <fit[+]> ARIMA(2,1,3)(2,0,0)[7]
## 2 2 <fit[+]> REGRESSION WITH ARIMA(1,0,0)(1,0,0)[7] ERRORS
## 3 3 <fit[+]> ARIMA(5,1,1)(0,0,2)[7] W/ XGBOOST ERRORS
## 4 4 <fit[+]> PROPHET W/ REGRESSORS
## 5 5 <fit[+]> LM
If this data is close to your data, you can see that the first non-seasonal difference was taken in both auto.arima() applications. This answers the question from Allison today.
The code below is used to plot the residuals for each of the models. The first three lines of code are very similar to the code shown in the package vignette. The additional code is to allow you to customize the plot to your liking.
models_tbl %>%
modeltime_calibrate(new_data = training(splits)) %>%
modeltime_residuals() %>%
plot_modeltime_residuals(.interactive = FALSE,
.type = 'timeplot') +
scale_x_date(breaks = scales::pretty_breaks(12)) +
scale_color_brewer(palette = 'Dark2') +
facet_wrap(~ .model_desc, ncol = 1, scales = 'free') +
theme_bw() +
theme(legend.position = 'none') +
labs(
title = 'Residuals plot for the five models based on our training data',
subtitle = 'The residuals are large on the same day irrespective of model')The modeltime_residuals_test() computes the results from
4 different statistical tests:
Shapiro-Wilk Test tests the Normality of the residuals. The Null Hypothesis is that the residuals are normally distributed. A low p-value below a given significance level indicates the values are NOT Normally Distributed.
Both the Box-Pierce and Ljung-Box Tests are used to test for the absence of autocorrelation in residuals. A low p-value below a given significance level indicates the values are autocorrelated.
models_tbl %>%
modeltime_calibrate(new_data = training(splits)) %>%
modeltime_residuals() %>%
modeltime_residuals_test()## # A tibble: 5 × 6
## .model_id .model_desc shapiro_wilk box_pierce ljung_box durbin_watson
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 ARIMA(2,1,3)(2,0,0)… 1.59e-29 0.880 0.880 2.01
## 2 2 REGRESSION WITH ARI… 4.54e-31 0.957 0.957 1.99
## 3 3 ARIMA(5,1,1)(0,0,2)… 2.34e-28 0.786 0.786 2.02
## 4 4 PROPHET W/ REGRESSO… 4.34e-30 0.0586 0.0579 1.83
## 5 5 LM 1.07e-30 0.00212 0.00205 1.73
# If you want to explore the residuals seperately, they are stored as follows
training_residuals = models_tbl %>%
modeltime_calibrate(new_data = training(splits)) %>%
modeltime_residuals() The results above indicate that the residuals are not normally distributed. However, they also indicate that no significant autocorrelation is exhibited in the residuals as indicated by both the Box-Pierce and Ljung-Box test results.
Multiple accuracy measures can be computed as follows.
models_tbl %>%
modeltime_calibrate(new_data = training(splits)) %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
)Note that MAPE is infinite in the below table since there are several
days where the count of new deaths = 0. The MASE indicates
that all models improve over the naive forecast by about 25%. However,
the rsq for all models is low. May be we need to add additional
predictors.
The name of the function from the modeltime package is
somewhat confusing. It is called calibrate but calibrating
adds a new column, .calibration_data, with the test
predictions and residuals inside. A few notes on Calibration:
calibration_tbl = models_tbl %>%
modeltime_calibrate( new_data = testing(splits) )
calibration_tbl## # Modeltime Table
## # A tibble: 5 × 5
## .model_id .model .model_desc .type .calibration_da…
## <int> <list> <chr> <chr> <list>
## 1 1 <fit[+]> ARIMA(2,1,3)(2,0,0)[7] Test <tibble>
## 2 2 <fit[+]> REGRESSION WITH ARIMA(1,0,0)(1,0,0)… Test <tibble>
## 3 3 <fit[+]> ARIMA(5,1,1)(0,0,2)[7] W/ XGBOOST E… Test <tibble>
## 4 4 <fit[+]> PROPHET W/ REGRESSORS Test <tibble>
## 5 5 <fit[+]> LM Test <tibble>
Per the description above the code chunk, the model descriptions did not change and we only appended the testing data.
models_tbl %>%
modeltime_calibrate(new_data = testing(splits)) %>%
modeltime_residuals() %>%
plot_modeltime_residuals(.interactive = FALSE,
.type = 'timeplot') +
scale_x_date(breaks = scales::pretty_breaks(12)) +
scale_color_brewer(palette = 'Dark2') +
facet_wrap(~ .model_desc, ncol = 1) +
theme_bw() +
theme(legend.position = 'none') +
labs(
title = 'Residuals plot for the five models based on our holdout data',
subtitle = 'The ARIMA type models are most accurate. LM is over-predicting since residuals are mostly negative') models_tbl %>%
modeltime_calibrate(new_data = testing(splits)) %>%
modeltime_residuals() %>%
modeltime_residuals_test()## # A tibble: 5 × 6
## .model_id .model_desc shapiro_wilk box_pierce ljung_box durbin_watson
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 ARIMA(2,1,3)(2,0,0)… 0.00000149 0.215 0.203 1.97
## 2 2 REGRESSION WITH ARI… 0.000156 0.0494 0.0438 0.432
## 3 3 ARIMA(5,1,1)(0,0,2)… 0.000000585 0.475 0.464 2.10
## 4 4 PROPHET W/ REGRESSO… 0.00680 0.148 0.138 1.59
## 5 5 LM 0.0000257 0.215 0.204 0.382
The holdout results are consistent with the observations made on the training data. One exception is that the ARIMA with xreg (model 2) shows correlated residuals on the holdout dataset.
Note that MAPE is infinite in the below table since there are several
days where the count of new deaths = 0
models_tbl %>%
modeltime_calibrate(new_data = testing(splits)) %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
)The first and third models continue to be the best models. However, their improvement over the naive forecast (i.e., random walk) is smaller than the training dataset.
Email: fmegahed@miamioh.edu | Phone: +1-513-529-4185 | Website: Miami University Official↩︎